Indra Mani

Opening Remarks



Project Summary




Project Detail

STATUS: Open for discussions. (Please check at my home page for the latest status)



UPDATED: 17-Oct-2012

I've tried putting together all my productive activities in as-it-is style. This gives employers greater ease to mix-n-match my skills with their requirements in an optimized manner. I am looking forward to discuss the assignments with prospective employers who require a set of my key skills.
Talking of my preferences, I would prefer -


Advisory Software Engineer at IBM India Software Lab, Hyderabad (Feb 2007 - till date)

Sr Staff Software Engineer at IBM India Software Lab, Delhi/Hyderabad (Feb 2006 - Feb 2007)

Technical Staff Member at IBM India Research Lab, Delhi (Feb 2004 - Feb 2006)

-- x -- -- x -- Recreational Break-- x -- -- x -- (Jan 2003 - Jan 2004)

Consultant at Siemens Information Systems Ltd, Delhi (Oct 2002 - Dec 2002)

Associate Consultant at Siemens Information Systems Ltd, Delhi (Oct 2000 - Sep 2002)

Sr. Software Engineer at Siemens Information Systems Ltd, Delhi (Oct 1999 - Sep 2000)

Software Engineer at Siemens Information Systems Ltd, Delhi (Feb 1999 - Sep 1999)

Freelance software consultant at HACE India Ltd Delhi (Jul 1997 - Feb 1999)

Research Assistant at IIT Delhi (Jul 1995 - Jul 1997)

Scientist at Torrent Research Center Ahmedabad (Jun 1994 - Jul 1995)

5 Yr. Integrated M. Tech. in Biochemical Engineering and Biotechnology, IIT Delhi (1989 - 1994)

Diplôme Supérieur de Langue Française, Alliance Française de PARIS (Jun 2004)

Oct 2011 - Oct 2012: Working as a member of Performance Engineering team on competitive analysis of emerging workloads for POWER7+ (Details...)

Apr 2011 - Mar 2012: Worked as a member of Performance Engineering team on microkernel based competitive analysis for POWER8 (Details...)

Oct 2010 - Oct 2012: Working as a member of Compiler Simdization team on performance analysis and tracking of IBM XL compiler for Blue Gene/Q supercomputer. (Details...)

May 2010 - Mar 2011: Worked as a member of Performance Engineering team on competitive analysis of spec2006 and WRF3.x POWER7 (Details...)

Jan 2009 - Apr 2010: Worked as a member of Performance Engineering team on performance analysis of simdized code for IBM XL compiler for POWER7 (Details...)

Oct 2008 - Apr 2009: Worked as a member of Compiler Simdization team on performance analysis of simdized code for IBM XL compiler for Blue Gene (Details...)

Jan 2006 - Apr 2009: Worked as a member of BlueGene Compiler on performance analysis and tracking of IBM XL compiler for Blue Gene supercomputer. (Details...)

Feb 2004 - Dec 2005: Worked as a member of High Performance Computing Group on benchmarking and optimization of Blue Gene supercomputer. (Details...)

Jan 2003 - Dec 2003: Worked on two exploratory projects, one was related to micro controller programming while the other was related to 3D modeling. (Details...)

Oct 2000 - Dec 2002: Worked as Project Manager for the development of CallConnect 4.x Product Suite and collaborated in its Deployment on site with team of size varying between 5-9 people. (Details...)

Oct 1999 - Sep 2000: Worked as Technical Lead for CallConnect 3.0 and 4.0. and designed various interfaces and APIs. (Details...)

Feb 1999 - Sep 1999: Worked as a team member for CallConnect 3.0. and implemented certain modules. (Details...)

Jul 1997 - Feb 1999: Worked as freelance software developer for HACE India Ltd. and developed a few UI products for their Data Acquisition System collectively called as Virtual Instruments. (Details...)

Jul 1995 - Jul 1997: Worked as research assistant as a part of PhD. Program at IIT Delhi and worked on non-equilibrium thermodynamics applied to biological systems. (Details...)

Jun 1994 - Jul 1995: Worked as scientist at CADD (Computer Aided Drug Design) Unit, Torrent Research Center Ahmedabad and worked on identifying Angiotensin II antagonists. (Details...)

Pre 1994: I learnt programming in the very first year I entered IIT Delhi in 1989 and solved various assignments/problems, including my final year project, using PASCAL, FORTRAN, and C. (Details...)


Oct 2011 - Oct 2012: With focus on Cloud and Big Data computing in IBM, a set of emerging workloads is identified for competitive analysis. Objective of this exercise is to identify gaps in performance and take corrective actions in IBM hardware and compilers if it is possible. This work is complementary to the comeptitive analysis of microkernel that has been undertaken earlier. The set of emerging workloads includes unladen-swallow (python), specjbb ( java ), minebench 3.0, graph500, cloudsuite, hadoop test suite, dvdstore. I have been involved in analysis of graph500, minebench and dvdstore benchmarks/benchmark-suites.

Apr 2011 - Mar 2012: This work primarily intended to evaluate quantitatively differences between standard assembly instruction sequences on IBM Power systems with respect to competitors systems. Instruction sequence that are analyzed were direct and indirect function call overhead, conditional branches, conditional move, switch statement, crypto instructions etc. The analysis used P7 hardware as well as systemsim and M1 simulator. The analysis was provided as a feedback to Power hardware design team.

Oct 2010 - Oct 2012: With arrival of early BGQ hardware, I started working on performance analysis of sequoia benchmarks for BGQ. The work included analysis of simdization on BGQ and OpenMP performance of sequoia benchmarks on BGQ in the beginning. Later single thead analysis of NAS-SER, Mira microkernels, basic loops and OpenMP scaling studies of NAS-OMP were performed. Analysis of STAMP benchmark was also done to study Transactional Memory implementation in BGQ hardware and software. I also tracked performance of evolving BGQ compiler culminating in GA release in May 2012. Currently I am doing regression analysis for November PTF2012 compiler candidates for BGQ. During this period, I have also done regression analysis of several BGP PTF candidate compilers.

May 2010 - Mar 2011: I worked on competitive analysis of specfp2006 as well as specint2006 benchmarks to identify common issues with IBM POWER systmes. Later similar analysis was done for WRF 3.x, a weather modeling application. This lead to identification of some gaps in a library which was fixed.

Jan 2009 - Apr 2010: Since Jan 2009, I started working on POWER7 simdization issues and learned to use P7 simulation tools. While working on simple loops like copy and daxpy to understand simdization on POWER7, I found some scheduling related anomalies. The near term objective for POWER7 simdization analysis is to get similar simdization induced performance improvement on P7 as I've demonstrated on BlueGene/P. I devised incremental strategy for enabling VSX simdization for P7. Analysis was progressively included basic loops, NAS applications, specfp2000, specfp2006. This analysis culminated with release of P7/AIX GA compiler around Apr 2010.

Oct 2008 - Apr 2009: After porting specFP2006 on Blue Gene/P during first of 2008, I started collecting performance and profiling data for these applications and analyzed it for performance improvements due to simdization. Two benchmarks (lbm and bwaves) showed considerable improvement (~15% improvement when simdization was enabled ) but analysis showed that it was not caused due to simdized calculations. All other benchmarks were either showing degradation or no improvement. I worked to discover missed simdization opportunities for these benchmarks and demonstrated improvements due to simdization for the following benchmarks – lbm (25% improvement), milc (7% improvement), leslie3D ( 5% improvement), cactusADM ( 3.5% improvement ). The technique used for lbm was reordering of elements during the packing of a structure in lbm so that a pair of calculations can be transformed into 16 byte aligned complex calculation. For milc it was done by establishing that parameters to functions doing 16 byte aligned complex calculations are disjoint at compile time (80% of the time they are) and using versioning otherwise. For leslie3D it was done by padding temp arrays in the beginning so that relative alignment of global and local arrays is same during calculations. For cactusADM missed opportunity was shown by hand coding a loop that took 7% of the computation time and showing that it leads to overall 3.5% improvement in the application.and there are similar calculations as part of other huge loop that takes 90% of the time but tough to hand code.

Jan 2006 - Apr 2009: Once we have published HPCC results in 2005 and Blue Gene/L established itself as the fastest computer, I started working with BlueGene compiler team as a performance analyst. During this period I added specFP2000, NPB3.2-SER, NPB3.2-OMP, specFP2006 benchmark suites to BG compiler performance bucket for BlueGene/L as well as BlueGene/P. I also extracted a number of critical loops/code and added to microkernel-suite. This helped in faster development cycle for compiler performance improvements. As of now microkernel-suite consists of 20 such microkernel some of which are taken from customer applications. The work included periodic measurement and analysis of development or release candidate compilers for regression with respect to GA compiler as well as anomalies due to simdization. Anomalies found were analyzed to isolate responsible component ( TPO/TOBEY) and to isolate optimization causing it and then filed as CMVC defects along with the analysis. I also worked on several customer requests to analyze performance and performance degradation due to simdization for their applications.

Feb 2004 - Dec 2005: I worked on benchmarking Blue Gene , that is #1 supercomputer during this period, and optimizing its performance. The benchmarking includes benchmarks like SPEC, HPCC, and MM5. SPEC is a benchmark having group of 26 common applications like gcc, gzip, etc from range of variety of domains to indicate single cpu power. HPCC is a group of programs to study parallel algorithms that rely not only on cpu power but communication between them as well. Matrix transpose, multiplication of HUGE matrices that can not be allocated to single cpu memory are such examples. MM5 is part of weather forecasting system that primarily implements some sort of FEM. Optimization is done primarily in three steps -

  1. Identifying piece of code (blocks) that takes major chunk of time on the representative datasets. this is done by instrumentation of code, either using compiler (profiling ), libraries ( papi, PCL, perfmon etc), timers (reading various performance counters using native implementation, system cycle counters)

  2. Take appropriate action to reduce time taken by these identified blocks. This is done by choosing appropriate compiler flags that improve performance of these blocks, re-organize the code of these blocks in higher level language, rewrite the code of these blocks in higher level languages, rewrite a part of these blocks in intrinsics/assembly, rewrite these blocks completely in intrinsics/assembly, strictly in the order they are written here.

  3. Identification of communication patterns within distributed programs using MPI profiling libraries and designing MPI rank=>physical node mappings to minimize communication paths taken by most frequent messages.

Jan 2003 - Dec 2003: After years of nonstop work, practically without breaks, year 2003 was a year of rejuvenation for me. However there are two small exploratory projects which are worth mentioning that I undertook. The first one was related to micro controller programming while the other was related to 3D modeling.

Equipped with multimeter and soldering iron, I assembled 8051 programmer kit hardware. The programmer could be used to program Atmel AT89C51 micro controller using parallel port of the PC. The assembly code (.ASM) of test program to blink LED was converted to HEX (.HEX) representation using Boreland TASM assembler. This was then transferred to 89C51 using software component of the programmer kit on the PC through parallel port. Also I assembled development & debugging platform kit which itself uses 89C51 micro controller and has a keyboard, LED display, RS232 interface and interfaces to various IO ports. The monitor (driver) program was assembled on the PC using tasm and a micro controller was programmed using above mentioned programmer kit. This programmed micro controller was then put into development kit. This kit, can be used to enter the assembled HEX code, view and modify internal registers and memory locations and run a particular piece of code. Also the entered data can be stored persistently between two different sessions. ( See Electronics For You Jan 2003, Feb 2003)

During second half of 2003 I explored various 3D-modeling & animation and video editing techniques using Blender, povray, Yafay, VirtualDub, Broadcast2000 and Cinelerra. An animation sample was dubbed with music and vcd was made using Blender, mjpeg tools, vcdimager, vcdgear on Linux and was tested successfully on VCD player/ TV combination. Similarly a 15 sec animated birthday greeting was created. I have also tried to record video from VHS cassettes and convert these to VCD format, using PixelView card (Bt878 chipset) NuppleVideo, exportvideo and mjpeg tools. The quality of recording was fine but after multiplexing video and audio streams into mpg these were out of sync somewhere in between.

Oct 2000 - Dec 2002: Worked as Project Manager for CallConnect 4.x with team size varying between 5-9 people. CallConnect is Siemens Information Systems Ltd's product to implement Call Centers. The Product suite consists of a server application along with C-API, ActiveX components, a client template in VB, an outbound campaign manager server, campaign generator and softphone, message player and web based administration and supervision. Besides these I was also administering GNATS bug tracking tool for various projects within the company. I've developed web interface to administer Gnats databases using perl cgi scripts which is accessed through Apache web server on Sun Solaris.

Oct 1999 - Sep 2000: Worked as Technical Lead for CallConnect 3.0 and 4.0. During this period I was involved with developing protocol for synchronized voice and data transfer between Telemaster IVR and CallConnect system. It was during this time that we acquired TADIRAN and HICOM PBXs and ported our code to work with these PBXs. During this period we also incorporated CTConnect's API into our product and designed/implemented C-API for our server.

Feb 1999 - Sep 1999: Worked as a team member for CallConnect 3.0. Here I implemented synchronized voice and data transfer during transfer and conference operations.

Jul 1997 - Feb 1999: Worked as freelance software developer for HACE India Ltd. I developed a few products collectively called as Virtual Instruments. These systems were Data Acquisition Devices (DAD) which had analog to digital converters on serial port. All of these shared same hardware and different products were packaged using different rules of data transformation, storing and display. A few of these were named as Data Logger, X-Y recorder, Weather Station, Single and Dual Met. Parameter Recorder. I developed these VIs in C++ using Boreland's C++ 4.0/5.0 IDE for Windows 9x using Windows SDK. The work involved was based on already existing Windows 3.1 code base in SDK. The data acquired were transformed using some configurable formulas, stored in proprietary formats as flat files and displayed in various modes- text, bar, dial, graph. All the graphics was developed and encapsulated as library to be used.

Jul 1995 - Jul 1997: Worked as research assistant as a part of PhD. Program at IIT Delhi. The project, I worked on, was related to non-equilibrium thermodynamics applied to biological systems. It was computational/theoretical project and involved mathematical modeling and simulations. I used Turbo Pascal 7.0 and Turbo C++ 1.0 to implement some models. As part of assistantship, I worked on simulation studies in Packed Bed Reactors, which was realized using C++/BC++2.0. During this time I also attended one semester course on Mathematical Foundations of Computer Science and learntabout Turing machines, computability etc.

Jun 1994 - Jul 1995: Worked as scientist at CADD (Computer Aided Drug Design) Unit, Torrent Research Center Ahmedabad. I worked on identifying Angiotensin II antagonists. Angiotensin II is a fragment of protein with eight amino acids, which is known to increase blood pressure in humane body. An antagonist is a substance having similar properties as Angiotensin II but its effect on BP is neutral. The work required energy minimization study of structures in dynamic simulations. Cerius II drug discovery Workbench was the main tool along with QUANTA and CHARMM on HP 9000 systems. I often wrote and used shell scripts and C programs for data transformation to desired format generated by above mentioned tools.

Pre 1994: I learnt programming in the very first year I entered IIT Delhi in 1989. It was PASCAL/TurboPascal5.0/DOS4.0/286 system. I did a course on Data structures and learnt C during second year. Towards final years, I have to learn FORTRAN to use a simulation package library written in FORTRAN. Final year project was modeling and simulation of charge flow within an ion channel. I used PASCAL to implement this project.

About CallConnect Product Suite: The CallConnect server is a multi threaded application written in C++ for Windows NT4.0 using MSVC 5.0/6.0. It receives telephony events using Novell's TSAPI or Dialogic's CTConnect API. Based on these events, it manages agent/call states and statistics and can log these to MS Access 97, Oracle 8.0 or MS SQL 7.0 using ODBC. It uses UDP/IP sockets to communicate to agent desktop applications. Synchronized screen pop is delivered to the agent desktop during transfer from one agent to another agent and conference with another agent. A protocol is been devised and implemented for synchronized voice/data transfer to/from SISL's Telemaster IVR.

The web UI has been developed in JAVA Applet/servlet while C-API has been enhanced to provide statistical information as well. Message player component added which uses Dialogic's analog voice cards to enable callceter supervisor to define voice messages to be played to customer when he waits in the queue. There is whole range of basic and advance routing schemes and priority queuing which were added during this period. List of supported features can be viewed here

We were working on incorporation of Call Blending, multimedia support and multi-site support in the CallConnect 4.X when I left the project.

About CallConnect Product Deployments: We have deployed our product first time at Grameen Phone Dhaka (Bangladesh) in Sep 2001. During my association with the project, we had deployed it on 5 more sites. ( BSNL Karnal, BSNL Gurgaon, BSNL Hyderabad, BTNL Bhopal, SBS Chilworth (UK)). My group developed agent screens and customized Reports for Grameen Phone, BTNL Bhopal and SBS Chilworth implementations.

Natural Languages: Hindi, English, French, Bangla, Gujrati, Urdu, Panjabi

IBM, Siemens, HACE, IIT Delhi, Torrent

Domains: Performance Engineering, Computer Architecture, Compiler Optimizations, High Performance Computing, CTI-Call Center, DAD-Virtual Instruments, Biology-Thermodynamics, Computer Aided Drug Design

Roles: Project Manager, Technical Lead, Team Member, Freelance Consultant, Researcher, Scientist

Computer Languages: C, C++, Perl, HTML, SQL, VB, VB Script, Java, PASCAL, FORTRAN

Systems: POWER7, Blue Gene, Intel/Red Hat Linux 6.0/7.1/8.0/9.0, Intel/Windows NT 4.0, Intel/Windows 9x, Intel/Novel Netware 4.11, Intel/MS DOS 6.0, HP9000/HP-UX, Sparc/Sun Solaris, ICL 2900/3600

*nix Tools: ssh, screen, vim, gcc, make, cvs, perl, python, sed, awk, gdb, gnuplot, tprof, pmcount, perf

Compilers/IDE: gcc 3.x/4.x, xlc 6.0/7.0/8.0/9.0/10.0/11.0/12.0, xlf 8.1/9.1/10.1/11.1/12.1/13.1/14.1, Visual C++ 6.0, Borland C++ 5.0, Visual Basic 6.0, Turbo Pascal 7.0, Turbo C++ 1.0

Design Tools: Rational Rose 98

Software Packages: Cygnus 3.110, Active Perl 5.0, Cerius II Drug discovery Workbench, QUANTA, CHARMM

Administration: TikiWiki 1.8.3, GNU GNATS bug tracking system, Apache web server, Red Hat Linux, MS VSS 5.0

API: MPICH2, LAM/MPI2, PAPI, perfmon, Novel TSAPI, Dialogic CTConnect, Dialogic SDK 5.0, MS MAPI

Technologies: Blue Gene, MPI, Windows SDK Programming, MFC, COM, ActiveX, Sockets, ODBC, Crystal Reports, Servlet/Applet, CORBA

DBMS: MySql 4, MS Access 97, Oracle 8.0, MS SQL 7.0, Foxpro 5.0

PBX: Coral TADIRAN, Siemens HICOM 300, Siemens HICOM 150, Realitis DX, Alcatel 4400, Ericson MD110, Novel Definity G3 Simulator


IBM tools: Systemsim, M1